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ABSTRACT 

We present a neural net algorithm for parameter estimation in the context of 
large cosmological data sets. Cosmological data sets present a particular chal- 
lenge to pattern-recognition algorithms since the input patterns (galaxy redshift 
surveys, maps of cosmic microwave background anisotropy) are not fixed tem- 
plates overlaid with random noise, but rather are random realizations whose 
information content lies in the correlations between data points. We train a 
"committee" of neural nets to distinguish between Monte Carlo simulations at 
fixed parameter values. Sampling the trained networks using additional Monte 
Carlo simulations generated at intermediate parameter values allows accurate 
interpolation to parameter values for which the networks were never trained. 
The Monte Carlo samples automatically provide the probability distributions 
and truth tables required for either a frequentist or Bayseian analysis of the one 
observable sky. We demonstrate that neural networks provide unbiased parame- 
ter estimation with comparable precision as maximum-likelihood algorithms but 
significant computational savings. In the context of CMB anisotropies, the com- 
putational cost for parameter estimation via neural networks scales as N^^"^. The 
results are insensitive to the noise levels and sampling schemes typical of large 
cosmological data sets and provide a desirable tool for the new generation of 
large, complex data sets. 



Subject headings: methods: data analysis — (cosmology:) cosmic microwave 
background — (cosmology:) cosmological parameters 
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1. Introduction 

A fundamental question in cosmology is the origin and evolution of large scale structure 
in the universe. The standard model for this evolution is the gravitational growth and 
collapse of initially small perturbations in the primordial density distribution. This picture is 
supported by the detection of primordial Cosmic Microwave Background (CMB) temperature 
anisotropies at a level of approximately one part in 10^ by the Cosmic Background Explorer 
satellite and a series of ground-based and balloon-borne experiments. Small perturbations 
on the matter and energy density in the early universe are reflected in the temperature 
distribution of the CMB, providing a "snapshot" of conditions the early universe while the 
perturbations were still in the linear regime. 

One angular scale of particular interest is the horizon size at the surface of last scattering, 
the epoch when the universe cooled sufficiently to form neutral hydrogen and allow the CMB 
photons to propagate freely. Causally-connected regions at the surface of last scattering, as 
viewed from the present epoch, subtend an angle 

e ~ 1=7 fir (^)"' 

J- + Zis 

where zis is the redshift at last scattering and VLq is the total density of the universe relative 
to the critical (closure) density. Anisotropy on scales larger than ~ 2° reflect perturbations 
larger than the particle horizon and thus probe the primordial density distribution. On scales 
smaller than 2°, causal mechanisms become important and modify the primordial density in 
model-specific ways. 

Oscillations in the coupled photon-baryon fluid within the primordial potential wells 
prior to decoupling lead to signature oscillations ("acoustic peaks") in the angular power 
spectrum of the CMB (Peebles and Yu 1970). A precise determination of the power spec- 
trum in this regime can probe a wealth of information about the early universe. For instance, 
the angular scale of the first acoustic peak depends primarily on the angle subtended by the 
Jeans mass at last scattering (Hu and Sugiyama 1995). It is thus a direct probe of the large- 
scale geometry of the universe and hence the density parameter Qq- The width of the acous- 
tic peak depends on the sound speed at decoupling, and in turn on the baryon density and 
Hubble constant as Vti^h^ . A measurement of this quantity could be compared with similar 
determinations based on primordial nucleosynthesis (Copi et. al. 1995; Olive et. al. 2000). 
See e.g. (Hu 1999), for a review of how the different physical processes in the early uni- 
verse leave their imprint on the CMB anisotropies. Measurements of the CMB anisotropy 
on medium angular scales thus offer an elegant determination of such parameters as the 
curvature, density, matter density, baryon density, Hubble constant, cosmological constant, 
density perturbation index n, scalar/tensor ratio, and ionization history of the universe. 



-3- 



Prom the one observable sky, we want to infer the values of these cosmological parameters 
with minimal uncertainty in the shortest possible time. Theoretical models do not predict a 
specific template for the CMB anisotropy (a hot spot at this location, a cold spot over there), 
but rather predict a statistical distribution usually expressed in terms of the angular power 
spectrum. Deriving the power spectrum from the data (or more generally, deriving model 
parameters directly from the sky maps) involves accounting for angular correlations between 
pixels, precluding use of simple linear least-squares techniques. The accepted standard in the 
CMB community has been the generalization of least-squares techniques as implemented in 
maximum hkelihood algorithms (see, e.g., (Gorski et al. 1996; Tegmark, Taylor, & Heavens 
1997; Bond, Jaffe, & Knox 1998; Borrill 1999). 

The simplest method of parameter estimation uses a goodness-of-fit test to compare a 
set of observables yi ± Uj measured at a set of positions Xi to a theoretical model Fj. If we 
have Nd data points i/i and Np parameters pj, we define 



is function of the parameters p and some fixed basis functions X{x). We obtain the "best-fit' 
parameter values by minimizing with respect to the parameters, 

1^ = (3 

for the j^^ parameter pj. The least-squares system in Eq. 3 has the solution 




(1) 



where 




(2) 




(4) 



k=l 



where 




(5) 



is an Np x Np matrix, and 




(6) 



is a vector of length Np. 
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If the data points are not independent, this relatively simple calculation becomes much 
more costly. Covariance between the observed data points can result from instrumental 
artifacts (correlated noise, instrumental resolution, oversampled data) or from correlations 
in the underlying signal (for instance, measuring in real space a signal whose components 
are independent in Fourier space). Equation 1 can be generalized to include the effects of 
covariance, 

= E - ^i)i^~%iyj - r.), (7) 
i=i j=i 

where 

M,, = ((i/,-(r,))(%-(r,))) (8) 

is the Nd X covariance matrix and the brackets denote an ensemble average. 

Conjugate gradient techniques can solve for without expliciting solving for 
and thus avoiding the 0{N^) operations this would incur. But if the covariance matrix 
M depends on the parameters pj, then minimizing will produce biased estimates for pj. 
Maximum-likelihood parameter estimation provide a tool to overcome this limitation. For a 
multivariate Gaussian distribution, the probability of obtaining the observed data yi given 
a set of parameters pj is 

C = P{y\p) = (2vr)-^^/^ ^^^[^ (9) 

where is defined in Eq. 7. The "best" choice of parameters is that which maximizes the 
likelihood function C. The curvature of the likelihood surface about the maximum defines 
the uncertainty in the fitted parameters. 



%>V(F-%- (10) 

where 

is the Fisher information matrix (Kendall & Stuart 1969) and L — — log(£) (see Bunn & 
Sugiyama 1995; Vogeley & Szalay 1996; Tegmark et al. 1997; Bond et al. 1998). 

The maximum likelihood estimator is unbiased and asymptotically approaches the equal- 
ity in Eq. 10. However, these advantages come at a steep price: both the ^-iid the deter- 
minant calculation in Eq. 9 scale as 0{N^), making brute- force techniques computationally 
infeasible. For Nfi > 10^ the time required is measured in years, even on the most powerful 
supercomputers. A number of authors have suggested ways around this problem. Karhunen- 
Loeve eigenvalue techniques produce moderate data compression, reducing the original 
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data points to N' Nd/W eigenmodes (Bond 1994; Bunn & Sugiyama 1995; Tegmark et 
al. 1997). However, estimating cosmological parameters from the smaller set of eigenmodes 
still scales as {N'y operations, making such techniques undesirable for mega-pixel data sets. 

Oh et al. (1998) derive a method for hkelihood evaluation using a Newton-Raphson 
quadratic iteration scheme. This method utilizes a conjugate gradient algorithm to evaluate 
X^- The determinant is first approximated using azimuthal symmetry of the noise matrix 
(appropriate for full-sky CMB maps), then corrected using Monte Carlo simulations. The 
method provides a nearly minimum-variance estimate of the angular power spectrum for 
CMB anisotropy maps in 0{Nj) operations and 0{N'^^'^) storage; cosmological parameters 
can then be derived by comparing the power spectrum to various models. Although this 
algorithm is fast enough for mega-pixel data sets, it has several weaknesses. Since it is a 
Newton-Raphson iterative scheme, it requires a sufficiently good starting estimate to guar- 
antee convergence to true maximum. It is also optimized to estimate the power spectrum, 
rather than the underlying cosmological parameters - when used as a root-finding technique 
in parameter space, the radius of convergence is small and the problems associated with 
parameter covariance become severe. 

The last few years have seen the development of a number of techniques designed to 
overcome the iV| problem, most of them focusing on estimating the power spectrum. These 
techniques include Szapudi et. al. (2000), who have developed a method based on using the 
two-point correlation function. Dore et. al. (2001) relics on an hierarchical implementation 
of the usual quadratic estimator to the power spectrum. Both methods arc known to scale 
as N^, with the first hoping to be improved to N^logNd while the second may scale as A^^ 
(with a large prefactor). There is also the Monte Carlo estimator to the power spectrum 
developed by Hivon et. al. 2001, which has been used (Netterfield et. al. 2001) to analysis 
the BOOMERANG data (deBernadis et. al. 2000). All these approximate techniques rely 
on first estimating the power spectrum and then using this estimated power spectrum to 
determine the most likely cosmological parameters. Douspis et. al. (2001) have shown that 
the data sets are not fully described by just their band power sets. 

Borrill (1998) offers a global solution to bound the hkelihood. This method uses Gaus- 
sian quadrature to bound the likelihood at any point in parameter space (not just near the 
likelihood maximum); it is thus well suited to search parameter space using the minimum- 
variance direct pixel basis. However, the method requires 0{N^ ) operations for each like- 
lihood evaluation and is thus significantly slower than the method of Oh et al. (1998). More 
importantly, it can only provide bounds on the likelihood, fixing \og{C) to accuracy of a few 
percent. Since log(£) > (a large number), errors of a few percent can create significant 
bias in the location of the likelihood maximum. 
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The Microwave Anisotropy Probe (MAP), launched in the summer of 2001, will measure 
the full sky in 5 frequency bands with over a million pixels per band (Bennett et. al. 1995). 
The Planck Surveyor mission, scheduled later in the decade, will produce maps with over 
10'' pixels (Bersanclli et. al. 1996). Maps of such size can not be analyzed with brute-force 
maximum likelihood techniques. 

Figure 1 shows typical information flow for cosmological parameter estimation. We com- 
pare data (CMB sky maps) to a parameterized model using some mathematical "machinery" 
to derive a set of parameters describing the data. Although least squares methods and their 
generalization to maximum-likelihood techniques arc common choices for this machinery, 
they are not the only choices possible. For mega-pixel data sets, these deterministic meth- 
ods are computationally infeasible. Neural net algorithms provide an alternative machinery 
for parameter estimation in large, complex data sets. 

Neural networks have been used previously in astronomy for galaxy classification (Lahav et. al. 1996; 
Andreon et. al. 2000) and periodicity analysis of unevenly sampled data as applied to stellar 
light curves (Taghaferri et. al. 2000). They have also been used to analyze stellar spectra 
(Bailer- Jones et. al. 1997; Bailer- Jones et. al. 1998; Bailer- Jones 2000), with results com- 
parable to traditional methods. However, Bailer- Jones et. al. compared data to a determin- 
istic model (stellar spectra), whereas cosmological applications examine random patterns 
drawn from parameterized stochastic models. We demonstrate the generality of our algo- 
rithm by considering different problems with the same network architecture. 

We use neural networks as a complementary approach to cosmological parameter esti- 
mation in large data sets. They arc a forward algorithm in that they "learn" the differences 
between two different parameter data points by being trained on sets of simulated data sets 
drawn at the specified parameter points. We find the computational cost for training the 
network, in the context of CMB anisotropy, requires 0(A^'^/^) operations and thus provides 
a substantial improvement over brute-force maximum-likelihood methods. 

Neural networks do not require that we specify one single statistic of a priori interest. 
As the network is trained, it determines how it will discriminate. The information required to 
separate different parameter points comes from the training set simulations. After selecting a 
pair of parameter points, a committee of networks are trained on simulations taken at each of 
the parameter points. We interpolate by sampling the trained networks with simulated data 
drawn at parameter points between the training sets. This takes advantage of the fact that 
for large input patterns, the number of sampling sets will be much less than the number of 
training passes. The algorithm naturally provides the distributions necessary to understand 
the confidence levels of the parameter fit when the physical data is analyzed. This sort of 
sampling of trained networks becomes a key component in utilizing a Bayesian approach 
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to parameter estimation, see (Christensen and Meyer 2000; Rocha 2000) for discussions of 
the Bayesian approach in the context CMB anisotropics and (MacKay 1995), along with 
(Bishop 1995), for neural networks. 

We present a brief review of the neural network algorithm and discuss how this model 
is used for parameter estimation. To demonstrate the feasibility of this algorithm, we show 
how we can estimate parameters in three different examples. The same neural network and 
parameter estimation algorithm is used in all the examples. Using these examples, we vary 
the size and noise characteristics of the simulated data sets to determine the computational 
scaling. 

2. Neural Network Algorithms 

Parameter estimation can be thought of as an exercise in pattern recognition, for which 
an extensive literature exists (see, e.g., McCulloch and Pitts 1943; Rosenblatt 1962; Watan- 
abe 1969; Hopfield 1982; Rumelhart, Hinton, & McClellend 1986; Grossbcrg 1988). Neural 
networks, an implementation of parallel distributed processing architecture, are well suited 
to pattern recognition. Although not widely recognized as such, neural networks are also 
well suited for the problem of parameter estimation in large, complex data sets. 

The basic building block of a neural network is the neuron, consisting of any number of 
inputs and a single output. A neuron sums its inputs to determine its output via some (non- 
linear) transfer function. A neural network consists of layers of neurons connected together 
via matrices of weights (Figure 2). A network is trained to produce a desired output by 
repeatedly presenting the network with a set of known input patterns, then adjusting the 
weights until the network produces the desired output for each known input. When the 
network is later presented with an unknown input, the output will reflect which of the 
known inputs the unknown input most closely resembles. 

Figure 2 shows a typical network architecture. The first layer in a network is its input 
layer, whose outputs reflect the pattern presented to the network. Any number of hidden 
layers lie between the input and output layers, from which the results of the network are 
read. We use one hidden layer and a single output unit. We thus have one matrix of weights 
connecting the input pattern to the inputs of the hidden layer and another matrix connecting 
the outputs of the hidden layer to the inputs of the output unit. 

The information content of a neural network lies in the matrices of weights connecting 
layers (analogous to the information content of the covariance matrix for maximum likelihood 
algorithms). The weights are set during the training process, in which we present the network 
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with a series of known inputs and adjust the weights to obtain the desired outputs. For a 
network with A^^ input units, we let Xpatt, a element vector, represent an input pattern. 
The output of the network can be viewed as a function of the input pattern: o = o(Xpatt)- 
For a given pattern we associate an output target tpatt (in our case a single value, but for a 
more general network, this would be a A^^output vector). A training algorithm is a method for 
adjusting the weights to minimize the total error E = "^^^^^ -^^patt; where 

1 2 
-^'patt = 2 (o(Xpatt) ~ ^patt) (12) 

We use the back propagation method (Rumelhart, D.E., Hinton, G.E., and WiUiams, R.J. 
1986). According to this method, we present a series of A^ivam patterns to the network, one 
at a time. For each pattern, we compare the output with the associated target and determine 
the error. We use this error to adjust the weights connecting the hidden layer to the output 
layer, and then back-propagate it to the weights between the hidden and input layers. Note 
that the training depends on the difference between the desired and actual outputs - no a 
priori statistical test for the inputs is specified. 

Neural nets achieve computational savings by training with simulated data sets, as 
opposed to inverting a large matrix. The computational cost to train a neural net is found 
to be 

A^cpu oc N^, (13) 

where the exponent a depends on the particular problem. There are two main contributions 
to this cost: the array multiplication needed to sum the weights to evaluate a single pattern, 
times the number of training patterns required. In general we find 1 < a < 3, depending 
on the specific problem. This is never worse than standard maximum likelihood methods, 
and at best the computational cost grows linearly with the input pattern size. For the 
specific case of parameter estimation in mega-pixel CMB maps, the computational cost 
scales as A'cpu c< A"^'^. Neural networks provide significant computational savings over 
both standard maximum-likelihood algorithms and the A^^ scaling for the fastest known 
approximate method. 



3. Neural Networks and Parameter Estimation 

Neural networks can be trained to estimate the value of a continuous parameter, and can 
reliably interpolate to parameter values intermediate between the training values. Though 
this idea is not new to astrophysics, e.g. (Bailer- Jones et. al. 1997; Bailer-Jones et. al. 1998; 
Bailer- Jones 2000), our method is fundamentally different from that of Bailer- Jones et. al. 
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in that our patterns are random samples drawn from a parameterized parent population. 
The Bailer- Jones et. al. method is a matter of template matching; randomness only enters 
their input patterns as instrument noise. For CMB and other cosmological data, the patterns 
themselves are intrinsically random. Nonetheless, using the same basic neural network archi- 
tecture, we can train the networks to discriminate stochastic patterns that differ according 
to the parent population from which they are drawn. 

We start by training a network to differentiate between simulated data sets (including 
instrument noise and other artifacts) generated at a pair of discrete parameter values. The 
back-propagation adjusts the weights until the network outputs target value when presented 
with the first set of patterns, and target value 1 when presented with the second. In practice, 
since the information distinguishing different parameter values is in the correlations, not the 
actual pixel values, any single network will not train to a sufficient degree. We improve the 
situation by training a small committee of networks and polling them to get a consensus 
opinion. Now we find simulations generated at the training parameter values produce two 
well defined peaks. Simulations generated with an intermediate parameter value (never 
present in training data) yield outputs peaking somewhere in between, depending on whether 
the new parameter value is closer to the first or second training value (see Fig 3). 

We quantify this behavior by presenting the trained network with a set of new inputs 
drawn from a grid of intermediate parameter values, and derive for each intermediate param- 
eter vahic the corresponding probability distribution of output values. When the networks 
are later presented with an unknown pattern, each distribution gives the probability that the 
unknown pattern was generated with a parameter value corresponding to that grid point. 
The interpolated parameter is the probability-weighted mean. The grid samples all use the 
same trained networks; the sampling of the networks at different grid points is faster than 
the training since we usually need many more training sets than sampling sets, thus no great 
computational cost is incurred. Although we focus below on estimating a single parameter, 
the method is readily extended to multi-parameter fits. 

The simulations are viewed as random variables X(p) taken from an underlying param- 
eterized probability distribution, with p the parameter. Assuming the data lies between two 
extreme values p^^^ and p^^^ we train a network to the target t^'^^ = for realizations X(p(°)) 
and t^^^ = 1 for X(p*^^)) by repeatedly presenting the network with new samples at p^^^ and 
p^^^ and back propagating the error. Once trained, the network will give an output o (X) 
between and 1 for any input parameter value. If p < p^^\ the output clips at 0, while if 
p > p^^^ the output clips at 1. 

Once we have trained a network, we can present additional, statistically independent 
samples drawn at p^^^ and p^^\ Figure 3a shows the output distributions for 1000 patterns 
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of each parameter. We see the network has successfully trained in that the distribution 
is peaked at o = while the p^^^ samples at o = 1. To be able to interpolate to intermediate 
values, we will need to present samples drawn at intermediate parameter values. Fig 3b 
shows the output distributions for an additional 1000 samples each for two parameters, one 
just a little larger than and one a little smaller than p'^^\ These distribution also show 
the same tendency to peak close to the limits of and 1, but not as strongly as those drawn 
at the trained parameters. In effect, the network is chosing which of the training parameters 
these new patterns, for which it was never trained, most closely resemble. At it stands now, 
this tendency makes it hard to construct the probability distributions we need for parameter 
estimation. 

By using a committee of networks, we take advantage of this peaking tendency. We 
want to determine the committee consensus and from this get the distributions we seek. 
The first step is converting the continuous output value into a discrete truth values or 1. 
For each trained network, we associate a midpoint value Omid and for any input pattern X, 
we define its truth value according to 

t.t(x) = |": . (14) 

1; 0(X) > Omid 

We interpret t(X) = as indicating the pattern was drawn from the parent population with 
parameter p^^\ and similarly we associate t = 1 with p^^\ 

To determine Omid, we present samples drawn at p^^^ and at p^^\ For each of these 
sets and any Omidi we obtain the truth values t^-^^ and t^p, i = 1, . . . ,N. With this, n'^'^^ = 
^^(1 — t^"^) is the number patterns drawn at p^^'' correctly identified as drawn at and 
n*^^) = tj-^"* similarly at p^^\ We chose Omid to maximize fc = 2f(^*'°'* ^*'^'*) ^^'^ refer to 
fc as the fraction correct, our main measure of how well a network has trained. In Fig 3a, 
Omid is marked with the vertical line and we find fc — 94%. 

We note fc allows us to determine the optimal number of training passes, A^xrain- Start- 
ing with an initially randomized network, as the network trains, we intermittently pause the 
training and sample the network to determine fc- It steadily increases to a maximum value 
and then levels out: the minimum for the training error E has been reached. We take A^rrain 
to be just where this plateau starts. 

For each network, any given input pattern is converted into discrete truth values t. We 
now form a committee of such networks, where the only difference between the networks 
is the initial randomization of the weights. We find committee sizes A^net ~ 50 sufficient. 
After presenting any given pattern X to the committee, we have the collection of truth 
values t(K)yn, m — 1, . . . , A^net- We view each as the vote from network m as to whether 



-li- 



the pattern resembled those drawn at p^'^) or p^^^ . The committee consensus is formed by 
generating the average truth value 

. Nnct 

"^•^^ m=l 

Figure 3c shows the distribution of average truth values for the same set of samples drawn at 
p^'^^ and p^^^ used in Fig 3a. They are now even more sharply peaked about t = and t = 1 
and in terms of the average truth value, fc = 100%. More important are the distributions 
displayed in Fig 3d. These are for the same intermediate samples as used in Fig 3b; we have 
now two well defined distributions with peaks intermediate to the peaks for the p^^^ and p^^^ 
samples. This is the general trend when we work in terms of the average truth value: as the 
parameter p is swept from to p^^\ we get well defined distributions whose peak moves 
from t ~ to t ~ 1. 

We utilize this behavior to have our networks interpolate parameter values for which it 
was never trained. We need a function that maps an average truth value to an estimated 
parameter: t i— > p(t). We built this function by sampling the committee of networks with 
samples drawn at parameter values intermediate to the training values (computationally a 
small cost compared with training a network). These in turn arc used to determine the 
optimal function p(t). The results of such sampling also automatically provide the distri- 
butions needed for either a frequentist's or Bayesian analysis. Selecting K parameter values 
uniformly distributed between the training values, p^' = p^^\ p^*^^ + Ap, p^^^ + 2Ap, . . . ,p^^\ 
we generate N samples at each of these parameter values. The samples are presented to the 
committee of networks and thus for each sample Xj drawn at each parameter value p'', its 
average truth value i'^ is computed. We determine the parameter estimation function p(t) 
by minimizing the distances between estimated and true parameter values, i.e. the error 



The t's will take on discrete values, t = 0, 1/A^nct, 2/A^net, and thus so will the pa- 

rameter estimation function: p(t) = pj when t = j/A^nct- Let Uj be the number of samples 

drawn at p'^ with = j/A^net- Then we can re- write our error as Ep = ^ Sit j (Pi — P^Y ■ 
Minimizing this yields 



1^.0 ^ P,^^. (17) 

i.e., each binned value of the parameter estimation function is the parameter weighted aver- 
age of the number of patterns of each sample that fall into that bin. Figures 3c and 3d are 
examples the distributions. 
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Given an observed data set, represented as the input pattern Xobs, we present it to 
the committee of networks and obtain its average truth value tobs- From this, we get the 
estimated parameter value pobs- We may now generate a set of samples at this parameter 
value, {Xj(pobs)}) and estimate their parameters, pj,fit- The distribution of the pj.fit's provides 
a self-consistent method for converging to the final estimation of the parameter. We begin 
with a wide separation between training values in an attempt to bracket the unknown input. 
We know the initial range is too narrow if either pobs ~ P^^^ or ~ p^-^\ that is, if the 
output clips at either end of its trained range. The distribution of the Pi,fit's also tells us 
when we have converged on final answer. We compare the width in the fitted parameter 
distribution to the separation between training values, and select new (closer) values. We 
intcratc this refinement of and p^^^ until the mean pfit doesn't change or the width of the 
fitted distribution matches the separation of training parameters. In practice convergence 
typically requires only two or three iterations. 

We illustrate how we determine the confidence levels of our parameter estimation via 
a Bayesian analysis of the results. This is not the only way to proceed; if one preferred, a 
frequentist's approach could be taken, similar to the above paragraph. From the Bayesian 
viewpoint, we are interested in posterior distribution V{p\pohs)' given the observed/estimated 
parameter pobs, what is the probability the true parameter is p. With this, we view the 
uncertainty in the estimated parameter as 



the width of true parameter values around Pobs that have a reasonable probability of being 
identified as Pobs- We can also use this determine the confidence levels. We use Bayes 
Theorem to express this in terms of the prior distributions: 



All these priors are readily available in the analysis we have developed so far. If there are 
the same number of patterns in each of the K sample sets, then we have a uniform prior 
on the parameters p'^ we sampled: V{p^) = 1/K. The probability of getting any particular 
Pobs is proportional to the total number of t's that have the corresponding value and hence 
^(Pobs) — X^fc 'T'obs- Given the true input parameter value is p^, the probability that it is 
identified as pobs is proportional to the number of samples with an average truth value that 
corresponds to pobs^ 'Pi'Pohsl'P^) = ""-jLobs/^- Thus, the probability distribution, Eqn. 19, 
is 'P(p'^IPobs) = ""-obs/ Xlfe ^obs- "^^^ resulting expression for the mean true parameter is the 
same as Eqn. 17, our optimal choice for defining the parameter estimation function. The 




(18) 



^(P I Pobs) = ^(Pobslp) 



(19) 
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68% confidence width for our parameter estimation becomes 

^^(Pobs) = ^'^'^y^^ - (p.,,)2 (20) 

2^ k '^obs 

It is important to note all the information used to determine this uncertainty in the fit was 
already generated during the determination of p(t) . 

We can use this Bayesian analysis to supplement our convergence criterion outlined 
above. The posterior probabihty P(p|pobs) is zero for p < p^^^ or p > p^^\ since there 
are no samples drawn outside the training range. Thus the uncertainty will be artificially 
suppressed close the to these limits. If pobs is within a{pohs) of either or p^^\ we need 
to select new training limits, i.e., wc have encountered the clipping previously discussed. If 
we are clear of either training limit and o'(pobs) is much less than the width of the training 
interval, we should choose new, closer, training limits. 

Whether we chose to do a frequentist's or Bayesian analysis, this method of using neural 
networks for parameter estimation benefits from not requiring the a priori definition of a 
statistic or goodness-of-fit function. We only need to be able to simulate the model, often a 
computational inexpensive task. Neural networks are ideally suited to working with models 
where the crucial information is in the phase information, e.g. topology of the Universe or 
tests for non-Gaussianity, along with being a faster algorithm for conventional parameter 
estimation problems. 



4. Examples 

To illustrate this algorithm, we analyze three examples. The first is the estimation of 
the frequency of an irregularly sampled noisy sine wave. This serves as a straightforward 
example of our method and shows how the neural network can recover a parameter value 
at which it was never trained. Next, we fit the spectral index for CMB anisotropics based 
on a Gaussian model, an example of the type of problem that will need accurate methods 
with low computational cost. The final example is parameter estimation in the context of 
large scale structure surveys, another arena with large data sets. This example illustrates 
the strength of the neural network to not need an a priori statistic. 



4.1. Irregularly Sampled Noisy Sine Wave 



A common problem in astrophysics is the estimation of periodic signals in noisy, irreg- 
ularly sampled data sets. Observing limits often produce data cutouts in time series; the 
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resulting aliasing of power then precludes simple Fourier transforms. Although other tech- 
niques such as periodograms exist for this problem, time series provide a simple introduction 
to the capabilities of neural nets. Our first example uses a time series consisting of — 256 
points, 

Xi{uj) = V2As sin (27ra; U + (J)) + rii (21) 

where ranges from to tf. We restrict data coverage by including 10 uniformly spaced 
gaps of 10 pixels each, plus an additional 24 randomly-placed gaps (Fig 4a). is a random 
phase and rii is Gaussian random noise of zero mean and unit variance. The factor of \/2 
gives signal-to- noise, S/N, of Ag for each observation (Fig 4b). 

This sine wave model is somewhere between a template and a random pattern. If 
= const., then this would be a matter of template matching, since but for the noise, the 
values of each of the pixels would be completely determined for each frequency. This is akin 
to the classification of stellar spectra (Bailer- Jones et. al. 1997; Bailer- Jones et. al. 1998; 
Bailer- Jones 2000). We are allowing to be a random variable and thus the location the 
peaks and valUeys of the sine wave vary from pattern to pattern. Our model has character- 
istics of a random pattern: the information about the frequency is now in the correlations 
of the pixel values. 

We train 50 networks using initial parameter range a;^'^^ = 4.0 and o;^^^ = 8.0 to produce 
an output value of for noisy sine waves with a; = 4.0 and output value 1 for a; = 8.0. We 
then sample the trained networks at intermediate values of uj to generate nine probability 
distributions V{o;u!i), cui — 4.0, 4.5, . . . , 8.0. Finally, we present the networks with 1000 
realizations at frequency cUjn = 5.60. Figure 4c shows the distribution of recovered Uoiys from 
these realizations. The mean value is intermediate between the training inputs, with no 
output close to the training values of 4.0 or 8.0. The distribution shows most of the estimated 
parameters lie between a;*^°^ = 4.5 and cu^^^ = 6.5, suggesting we repeat the analysis with 
these new training values. Fig. 4d shows the result after a second iteration. The peak of 
the distribution of estimated parameter values has shifted somewhat from from the previous 
iteration. Selecting new training values = 4.6 and u;^^^ — 5.8 to encompass the bulk 
of the estimated parameters, we do one more iteration. The peak for this distribution of 
estimated parameters has not shifted from the previous set of networks and the width of the 
distribution matches the range of sampled frequencies. The distribution peaks at a; = a;in, 
with fitted mean cUohs = 5.56 ± 0.11. 

Note that we correctly recover despite never training any network at any stage to this 
value. Neural networks provide an unbiased estimator for a continuous parameter without 
requiring prior knowledge of the parameter value. 



- 15 - 



4.2. Microwave Background Anisotropy 

Neural networks can also be applied to more challenging computational tasks. Deriving 
cosmological parameters from maps of the cosmic microwave background usually involves 
maximum likelihood algorithm whose computational cost (N^) makes them prohibitive for 
mega-pixel maps. The same neural net topology used for our time series example provides 
accurate, rapid parameter estimation for CMB maps as well. 

On angular scales 9 > 2°, anisotropy in the cosmic microwave background corresponds 
to primordial density perturbations with scale-free power spectrum Pj. oc A;", where k is the 
wavenumber and n is the power-law index. We simulate the COBE-DMR full-sky maps of the 
CMB anisotropy parameterized by the index n (Bond and Efstathiou 1987). We smooth full- 
sky maps with a Gaussian profile with a FWHM of 7° to include the effect of the radiometer 
horn profile and add Gaussian noise with an variance of (20 mK)^/A^obs,i to each pixel to 
account for the instrument noise (Bennett et. al. 1996). 

Foreground emission from our Galaxy dominates the GOBE data near the Galactic 
plane, rendering it unusable for cosmological analyses. We use the galaxy cut template 
of (Banday et. al. 1997) to excise pixels with significant Galactic emission. The cut sky 
represents an additional challenge for standard maximum-likelihood analyses. In the absence 
of this cut, the data sets represent full-sky coverage and can be decomposed in terms of 
orthogonal spherical harmonics. The resulting coefficients yield the power spectrum of the 
GMB and hence the spectral index n. Once the galaxy cut is imposed, the spherical harmonic 
functions are no longer orthogonal on the remaining pixels. Any attempt to to obtain a 
harmonic expansion will result in the aliasing of power between modes and an inaccurate 
power spectrum. Though a new orthogonal set of basis functions can be computed for 
the cut sky (Gorski, 1994), this is an problem as well. Since neural networks estimate 
cosmological parameters using the real-space pixel values, they need not take the detour 
through the power spectrum, and do not suffer aliasing of power. We simply impose the cut 
for galactic emission and train each network using only the remaining high-latitude pixels. 
As the network is trained , it automatically learns the effect of cuts in the data, without 
requiring any symmetries in the cut data (see e.g. (Oh et. al. 1999)). 

We generate simulated CORE maps using sky pixels of size 2.5° x 2.5° each. After 
the galaxy cut, this leaves an input pattern of = 3881 pixels. We use A^Hidden = 600 
and ATrain = 12000 (determined to be the optimal choice) and train 50 networks over the 
parameter range = 0.0 and n^^^ — 2.0. Figure 5 plots the probability distribution 
of nobs for a set of 1000 samples of riin — 1.40 (dotted line). This distribution matches 
with separation of the training sets and we need this only training iteration. We recover 
^obs — 1-30, with 68% of the samples between 1.07 and 1.51. In terms of our Bayesian 
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analysis, for n ~ 1, we determine a — 0.35, in agreement with the traditional maximum 
hkehhood analysis of the COBE-DMR 2 year data (Gorski el. al. 1994). 

Note that the neural network algorithm recovered the correct spectral index even though 
none of the networks used were trained at this value. The uncertainty, derived from the width 
of the probability distribution of riobs, is comparable to the value predicted by the maximum 
likelihood method. Neural networks can recover cosmological parameters from CMB data 
sets with comparable precision as maximum likelihood techniques, but using A^^'^ calculations 
instead of N^. 



4.3. Large-Scale Structure 

Large-scale structure provides a third example that combines parameter estimation 
and phase features. Redshift surveys such as the 2-Degree Field and the Sloan Digital Sky 
Survey measure the redshift and position on the sky of a large number of galaxies (A^ ~ 10^), 
sampling the quasi-linear regime ~ 100h~^ Mpc where h is the Hubble constant in units 
100 km s^^ Mpc~^. The observed redshift is the sum of the Hubble flow and the peculiar 
velocity induced by gravitational acceleration in the evolving density field. Coherent flows 
on large scales produce artifacts in the redshift distribution compared to real space. Galaxies 
on the far side of an overdensity tend to flow toward the center (hence toward the observer) 
so that their peculiar velocities subtract from the Hubble flow, making them appear closer 
than they really are. Galaxies on the near side move the opposite direction, so their peculiar 
velocities add to the Hubble flow. The net result is an apparent enhancement in the galaxy 
density in redshift space on scales of superclusters, compressing the region along the line 
of sight to the observer. The amplitude of this "bull's-eye" effect depends on the matter 
density flm on scales comparable to superclusters of galaxies and can be used to determine 

in model-independent fashion (Praton et. al. 1997; Melott et. al. 1998). 

Estimating from distortions in redshift space has several problems in practice. The 
first is defining a statistic to quantify the bull's-eye enhancement in concentric rings about 
the origin. (Melott et. al. 1998) use a large number of simulations to develop an empirical 
statistic defined as the ratio of rms spacing between upcrossings in isodensity contours in the 
redshift (radial) direction to that in the orthogonal (azimuthal) direction. It is thus a local 
statistic in that it compares high-density regions only to other nearby regions, and operates 
only on a single slice of redshift space after smoothing and contouring. 

Neural nets, by contrast, offer a global test by comparing each region of the density 
field to all other regions simultaneously, and can easily be extended across the entire three- 
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dimensional survey. No a priori statistic need be identified, nor do neural nets require 
contouring of tlie density field, thus avoiding the need to "fine-tune' the selection of contour 
levels. Figure 6 shows a toy model of the bullseye effect for several samples differing in the 
amplitude of the radial density enhancements. Our toy model consists of using a Bcssel 
function to modify an otherwise uniform distribution of galaxies, with the amplitude deter- 
mining the amount of modulation. The inputs to the neural network are density functions, 
one for each realization. Neural nets, trained to realizations differing only slightly in the 
amount of modulation, correctly discriminate between them. We do not specify how to tell 
them apart, but nonetheless, the correct amphtudes are recovered 90% of the time. Note 
that we test the bullseye effect using the same neural nets developed for the noisy sine wave 
and CMB tests above, simply trained to a different set of reference models. Neural nets are 
a powerful, versatile tool for a variety of problems in astrophysics. 



5. Computational Cost Scaling 

Neural nets have many desirable characteristics for parameter estimation with mega- 
pixel CMB maps. They operate globally on the data and return unbiased estimates of the 
underlying parameter values. They automatically account for data gaps, instrument noise, 
and other features peculiar to a particular data set. Most importantly, the computational 
costs are low enough to allow extension to the mega-pixel data sets expected in the near 
future. 

The dominant contribution to the CPU cost is the array multiplication associated with 
the weights. This multiplication is performed twice per training pass: once for evaluating 
the pattern, then again for back propagating the error. Each operation scales as the number 
of weights. Our total computational cost for training a network is 

A^CPU = 2 A^TVainA^Hidden {Nd + 1) (22) 

How this cost scales with N^, depends on the problem being considered. We find that Nqpu ~ 
N"' with 1 < a < 2.5. For the particular application of mega-pixel CMB maps, Nqpu ~ 

''Input • 

The computational cost depends on how the information content of the input pattern 
changes as more data points are added. We consider two limiting cases. The first case has 
fixed signal to noise per data point, so that the signal to noise ratio of the entire data set 
improves as more points are added. This is analogous to measuring a signal with a noisy 
detector, and simply adding more observations. The second case keeps the signal to noise 
ratio of the entire data set fixed, so that the signal to noise ratio per pixel gets worse as 



-18- 



more pixels are added. This is analogous to over-sampling a signal within a limited observing 
time: as the observations are broken into finer and finer resolution, the observing time per 
point decreases and the noise per observation gets worse. 

We treat these limiting cases using our noisy sine model. For the first case (fixed noise 
per observation), we simply simulate more data over additional periods of the sine wave, 
tf grows with A^^^ (maintaining both the pattern of regular cutouts and the percentage of 
random cutouts). The noise per pixel is fixed at al^"^^^ = 1^ independent of N^. For the 
second case (fixed noise averaged over the entire data set), we keep the number of periods 
per data set fixed, tf = 1, but allow the noise amplitude to vary as cr"™'^'^ = Co/ \/ N°^^, and 
fix ao so af'^" = 1 when = 256. 

The number of training passes and the number of hidden units depend on problem 
considered and the number of input pixels. For each case above, we determine which values 
give the best iVcpu for a fixed training accuracy. For a range of A^Hidden, we train multiple 
networks on a large set of patterns. As they are being trained, we monitor their ability to 
correctly classify independent samples. Once this ability passes a pre-set threshold, we know 
A^Train for ©ach uctwork, and hence A^cpu- We repeat for multiple network to estimate the 
uncertainty in the CPU cost. (We may also vary other training parameters to assure we 
have the optimal results.) The results arc not strongly dependent on the precise value of 
^Hidden- Abovc a minimum value, as A^Hiddcn increases, the number of training passes needed 
decreases in such a way that A^^cpu remains constant over a wide range of A^Hiddcn- For these 
limiting cases, we find Nqptj ~ for fixed noise per pixel, and iVcpu ^ for fixed noise 
per data set (Figure 7(a)). 

The computational costs for CMB analyses lie between these extremes. We derive the 
scaling for CMB maps by analyzing the CPU costs as progressively larger and larger areas 
of the sky arc covered. In this way, new information is introduced into the data sets as the 
patch size increases. The S/N ratio per pixel is fixed in this scheme, reflecting the trend in 
current experiments of scanning ever larger portions of the sky at (roughly) constant S/N 
ratio per pixel. 

We select circular patches of sky centered at the north zenith. The range of patch sizes 
are chosen to cover 1.5 orders of magnitude in TV^. For each patch size, we proceed as above 
to determine the optimal CPU cost. The results are shown in Figure 7(b). The compu- 
tational cost for training a CMB network scales according to A^cpu ~ ^d^: ^ considerable 
improvement over the scaling behavior for a maximum likelihood analysis. 
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6. Discussion 

We have shown neural networks can be used as a tool for astrophysical parameter esti- 
mation. For specificity, we have worked in a cosmological context (CMB maps and redshift 
surveys) where the stochastic nature of the problem is fundamental. The results are insensi- 
tive to noise levels and sampling schemes typical of large astrophysical data sets and provide 
parameter estimation comparable to maximum likelihood techniques. The computational 
cost is never greater than standard maximum likelihood techniques and in the context for 
CMB anisotropy maps, we find A^cpu = 0{Nj-^). 

If we classify parameter estimation techniques as to whether they are forward or reverse 
algorithms, we see the real strength of neural networks. Maximum-likelihood methods are an 
example of reverse algorithms. They start with the statistic under consideration and work 
backwards, inverting a covariance matrix, to the likelihood function used to compare different 
parameter choices. Forward algorithms provide a way to avoid the high computational costs 
of inverse methods. Typically, it is much simpler to generate model predictions at each 
sampled point in parameter space than to compute the matrix inverse and determinant 
required for maximum likelihood techniques. Forward algorithms trade many realizations of 
synthetic data sets computed at specific parameter values for the computationally infeasible 
matrix inversion. Neural networks are such an algorithm; synthetic data sets are used to 
both train and sample the networks. This gives us our speed improvement. 

Since either maximum likelihood or neural networks can be viewed as the "machinery" 
for parameter estimation, the fundamental information fiow stays the same (see figure 1). 
The statistical confidence levels for the fitted parameters are always accessible. When the 
"machinery" is sampled with independent synthetic data, we can determine the probabilities 
for making correct or incorrect parameter identifications. Such sampling also gives us direct 
access to the statistical power (Phillips and Kogut 2001). While training, the information 
distinguishing the different parameters is encoded in the weights. Interperting the resulting 
weight matrices is not usually possible (as compared to the Fisher matrix, Eqn 11). Using 
independent sampling of the network to derive the probability distributions needed for, e.g. 
Bayesian analysis, means we do not need direct access to the information in the weight 
matrices. 

A limitation of maximum likelihood methods is their requirement of an a priori defi- 
nition of a goodness-of-fit function. The choice of a goodness-of-fit function is not always 
obvious and is particularly acute for 2D and 3D surveys. Much of the information lies 
in the phase features of these surveys. Statistical tests can fail badly in detecting phase 
features, as witness the large hterature devoted to the relatively simple problem of edge 
detection in 2D data sets (see, e.g., (Hough 1962; Davis 1975; Canny 1986)). Topological 
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tests such as the genus or other Minkowski functional have been apphed to 2- and 3-D 
maps, (Gott et. al. 1990; Kogut et. al. 1996) but the relative power of these statistics is 
poor (Philhps and Kogut 2001). Neural networks, in contrast, do not require specification 
of a single statistic of a priori interest. As the network is trained, it determines how it 
will discriminate between competing models. The information required to separate differ- 
ent parameter values comes from the training set simulations. Neural networks thus offer a 
promising approach to cosmological parameter estimation, where the statistical properties 
of the primordial matter and energy distribution provide one of the few falsifiable tests of 
the standard inflationary paradigm. 
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Fig. 1. — Our basic model for parameter estimation from a data set. We have to a priori 
assume a model to compare the observed data set against; what we pay attention to is 
the machinery for performing this comparison. Maximum likelihood methods, the de facto 
standard in the CMB community must assume a model, just as must be done with currently 
proposed neural network method. 
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Input 
Layer 

Fig. 2. — Neural network architecture used for parameter estimation. The data (noisy sine 
wave, CMB maps or galaxy survey) are the input pattern and the neural network feeds 
forward through the hidden layer to the output unit. The output value is used to derive the 
probability distributions for the samples and estimated parameter for the observed data set. 
Typically there are fewer hidden units than input units. 
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Fig. 3. — Sample neural network output distributions, (a) Solid line is the distribution of 
output values for an independent set of samples drawn at the same parameter value for 
which the network was trained to the target 0. The dotted line is for samples drawn at p^^\ 
the value for the target 1. The vertical line is the midpoint value Omid that maximizes the 
fraction correct fc, i.e., all output values < Omid are identified as being drawn at while 
the rest at p^-^\ (b) The output of the same network, but for samples drawn at parameters 
intermediate to p^^^ and p^^\ The solid line is for p close to p^^^ and the dotted line for a 
choice close to p^^\ (c) Solid line is the average truth value t for the p^^^ patterns, averaged 
over a committee of 50 networks, the only difference between the networks being the initial 
randomization of the weights. The dotted line is for p'^^K (d) The average truth value for 
the same sets of samples in (b) . The averaging has produced two well defined peaks that are 
cleanly separated. Distributions of t hke those in (c) and (d) become the basis for predicting 
which estimated parameters to associate with the average truth values, the ouput due to 
presenting a committe of networks with an unknown pattern. 
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Fig. 4. — Neural net results from noisy time series, (a) Sample sine wave for coin — 5.60, 
including 10 regularly spaced gaps in 9, along with 24 randomly placed gaps, (b) Same 
sample sine wave, but now including Gaussian noise so that the resulting input network 

pattern has a S/N=0.4 (c) Histogram of fitted oOohs ior an initial network training range of 
1^(0) = 4,Q and uj^-^'^ = 8.0. (d) Fitted values after second iteration. Note that the training 
values now bracket the distribution returned by the first iteration. Vertical dashed lines 
show the training frequencies, (e) Fitted values after third (final) iteration. The width of 
the distribution matches the separation of training values. Note that we correctly recover 
the input frequency, a;obs = 5.56±0.11, even though none of the networks were trained at 
this value. 



-28- 



50 



Mean Estimated Value = 1.30 



40 



30 



20 



10 







-True Value = 1.40 



1 



0.0 



0.5 1.0 1.5 

Fitted spectral index n 



2.0 



Fig. 5. — Fitted spectral index riobs derived from 1000 realizations of CMB anisotropy sky 
maps with = 1.25. The dotted line is for an initial training range of = 0.5 and 
n^^) = 1.5 while the sohd line is the distribution for the final range of n^^^ — 0.8 and 
n^^^ = 1.4. The fitted values correctly peak at the input value (vertical sohd line), despite 
never having trained on this parameter value. 
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Fig. 6. — Toy models showing changes in the density enhancement A in redshift space due 
to pecuhar velocities (the "bull's-eye effect"). The same neural nets developed for CMB sky 
maps successfully discriminate at 90% confidence among all models shown. 
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Fig. 7. — CPU cost scaling (a) Scaling of computational costs for the limiting cases of 
irregularly sampled noisy sine wave. We consider two cases: i) the S/N per pixel is fixed 
(diamonds/solid line); and ii) the S/N for the entire pattern is fixed and the noise per pixel 
grows with (circles/dashed line). These limiting cases are well described by power-law 
fits Ai'cpu ~ ^input with a = 1.0 and 2.5, respectively, (b) Scaling of computational costs 
for CMB anisotropy. Working at a fixed sky map resolution, we vary the patch size that 
is examined. This holds the S/N per pixel fixed, but new information is introduced as the 
patch size increases. The solid fine represents a power law fit of A^cpu ~ ^d^- 



